Investigating the user journey on an ecommerce site by Jaume Clave
June 10th, 2020
Website tracking is when websites collect information about site users to monitor their online behavior. This happens on every website in today’s time. Free services like Google Analytics, Google Tag Manager and other sophisticated products such as Adobe Analytics help track users and process data that is collected in ways that is quickly digestible and understandable by business owners and decision makers.
Websites collect a vast array of data for many different uses. This includes data you provide via forms, for example, email address and credit card information, as well as many other types of information gained from tracking technology.
Some of the data points websites collect include:
IP addresses to determine a user’s location.
Information about how the user interacts with websites. For example, what they click on and how long they spend on
a page.
Information about browsers and the device the user access the site with.
Browsing activity across different sites. This gives those with access to the information insight about the individual
user’s interests, shopping habits, problems they are facing, and more.
Not all websites collect all the above data. Some don’t collect any data at all. It will all depend on the service the website is providing as well as how the site is monetized. These data points may also be used to optimise the user journey. In other words, change the website layout based on data-driven decision in order to facilitate the end goal, in an ecommerce setting, a product purchase.
This project takes a look at anonymized data from an ecommerce site. The goal is to analyze the data to understand how customers behave and purchase on the site so that management may begin to start plans to restructure the site in hopes of increases purchases. The project will capture and show the customer journeys between the various elements (pages).
The Data
i. Clickstream Feed
ii. Products Feed
iii. Registered User Feed
iv. Merging the Data Frames
Markov Chains
i. Transition States, Probalities and Matrix
ii. Heat Map
iii. Removal Effects
iv. Visualising the Markov Chain
Sankey Diagram
i. Data Processing
ii. Plotly’s Sankey Data Structure
iii. Visualising Snakey User Flow With Plotly
iv. Last Page Event Sankey
NetworkX Graph
i. Graph Layouts
ii. Network Analytics to Visualise Website Structure
This section load the necessary python modules that are needed to begin to explore and load the data into the Jupyter Notebook. There are three files that are loaded into the Notebook, all .tsv files. Each file is explained before it is loaded, organized and displayed.
## Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
The 'clickstream-feed-generated.tsv' is a file that has been provided by an ecommerce website in the United States. This data set contains the clickstream id, a timestamp that indicates when a page was initially loaded, the IP address, the URL accessed, if the customer ended up making a purchase on that action, the user_session_id which is assigned to each unique user session, the city from which the website was accessed and its respective state.
A .tsv file or "tab-separated values" file is a simple text format for storing data in a tabular structure, e.g., database table or spreadsheet data, and a way of exchanging information between databases. Each record in the table is one line of the text file. The data is loaded with pandas read_csv. function with a pipe separator in order to format it in a readable pandas data frame way.
## Loading clickstream data
data = pd.read_csv('clickstream-feed-generated.tsv',
sep = "|",
names = [ 'clickstream_id',
'timestamp',
'IP_address',
'url',
'is_purchased',
'is_page_errored',
'user_session_id',
'city',
'state',
'country' ],
dtype = { 'clickstream_id':np.str,
'IP_address':np.str,
'user_session_id':np.str,
'AcqDate':np.str },
parse_dates = ['timestamp'],
infer_datetime_format = True,
dayfirst = True,
encoding = 'utf-8'
)
cs_df = data.copy()
## View df
print(f'The clickstream data set has {cs_df.shape[0]} rows and {cs_df.shape[1]} columns')
cs_df.head()
The 'products.tsv' forms a data frame that shows the url, the category of the page and its respective ID. This data frame will be used when merging on the URLs. This will allow for easier examination of a user’s website path.
## Loading products
data = pd.read_csv('products.tsv',
sep = "\s+",
dtype = {'id' : np.str},
encoding = 'utf-8'
)
products_df = data.copy()
## Products dataframe
print(f'The products data set has {products_df.shape[0]} rows and {products_df.shape[1]} columns')
## Rearranging data
NanIndex = pd.isna(products_df['id'])
products_df.loc[NanIndex,['id']] = products_df.loc[NanIndex,['category']].values
products_df.loc[NanIndex,['category']] = products_df.loc[NanIndex,['url']].values
products_df.loc[~NanIndex,['category']] = products_df.loc[~NanIndex,['url']].values + " " + \
products_df.loc[~NanIndex,['category']].values
products_df['url'] = products_df.index
products_df['id'] = products_df['id'].astype(np.str)
products_df.set_index('id', drop = True, inplace = True)
products_df.head(16)
The 'regusers.tsv' data frame holds very important data on the websites registered users. There are some errors in the data field which is unfortunate, however the gender is revealed which might allow us to explore possible different purchasing trends.
## Registered user
data = pd.read_csv('regusers.tsv',
sep = "\t",
index_col = 'SWID',
engine='python',
parse_dates = [ 'BIRTH_DT' ],
dayfirst = True,
skipfooter = 2,
encoding = 'utf-8'
)
reguser_df = data.copy()
## Registered user dataframe
print(f'The products data set has {reguser_df.shape[0]} rows and {reguser_df.shape[1]} columns')
reguser_df.head()
All three dataframes will be merged in order to create a master DF that contains complete information about a user’s website behavior.
## Products full URL
products_df['full_url'] = 'http://www.RL.com' + products_df['url']
products_df.head()
## Merge
analysis_df = pd.merge(left = cs_df, right = products_df, how = 'left', left_on = 'url', right_on = 'full_url')
analysis_df.drop(['url_x', 'url_y'], axis = 1, inplace = True)
analysis_df.head()
## Creat visit order
analysis_df.sort_values(['user_session_id', 'timestamp'], ascending = [False, True])
analysis_df['visit_order'] = analysis_df.groupby('user_session_id').cumcount() + 1
A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. These events are characterized by the "memorylessness" property of certain probability distributions. It is a process for which predictions can be made regarding future outcomes based solely on its present state and -most importantly- such predictions are just as good as the ones that could be made knowing the process's full history. Conditional on the present state of the system, its future and past states are independent.
In this situation Markov chains can provide a framework to statistically model user journeys on the site and how each page factors into the users traveling from one page to another to eventually purchase (or not). The transition probabilities between states (pages) can be calculated and using these probabilities the statistical impact of a single page has on the total amount of purchases can be identified.
The core concepts of Markov chains can be used here with the clickstream data to identify the probabilities of moving from one event (page) to another in our network of potential website pages and purchase events. This section does just that, initially the data needs to be preprocessed in order to show complete user journey on the website. From there the transition probabilities can be computed using the transition states and the distinct paths. Finally the transition matrix will be calculated and used as the basis of a heatmap which will visually display which pages are most likely to be travelled when a user is on a certain page.
## Create customer path
paths_df = analysis_df.groupby('user_session_id')['category'].aggregate(
lambda x: x.unique().tolist()).reset_index()
## Merge both DFs
last_interaction_df = analysis_df.drop_duplicates('user_session_id', keep = 'last')[['user_session_id', 'is_purchased']]
paths_df = pd.merge(paths_df, last_interaction_df, how = 'left', on = 'user_session_id')
## Add start and end to journey
paths_df['paths'] = np.where(
paths_df['is_purchased'] == 0,
['Start,'] + paths_df['category'].apply(','.join) + [',Null'],
['Start,'] + paths_df['category'].apply(','.join) + [',Purchase'])
paths_df['paths'] = paths_df['paths'].str.split(',')
The data frame below is in wide-form. It is a data frame containing a single row per user and the total chronological user-journey in a list of touch-points.
This has been done by first grouping the chronological touch-points into a list, merging the list of final conversion/non-conversion events onto that data frame and finally adding a “Null” or “Purchase” event to the end of our user-journey lists.
## Save to .csv
paths_df.to_csv('customer_journey.csv')
paths_df.head()
Initially a list of all user paths need to be designed. Similarly, the conversion rate from the total purchase count can be calculated. This will all be used later on to help define the attribution.
## Get paths/journeys
list_of_paths = paths_df['paths']
total_purchases = sum(path.count('Purchase') for path in paths_df['paths'].tolist())
base_conv_rate = total_purchases / len(list_of_paths)
The function below identifies all potential state transitions and outputs a dictionary with its contents.
## Function to calculate potential state transitions
def transition_state(list_of_paths):
list_of_unique_paths = set(x for element in list_of_paths for x in element)
transition_state = {x + '>' + y: 0 for x in list_of_unique_paths for y in list_of_unique_paths}
for possible_state in list_of_unique_paths:
if possible_state not in ['Purchase', 'Null']:
for user_path in list_of_paths:
indices = [i for i, s in enumerate(user_path) if possible_state == s]
for col in indices:
transition_state[user_path[col] + '>' + user_path[col + 1]] += 1
return transition_state
## Calling transition_state function
trans_states = transition_state(list_of_paths)
The function below use a defualtdict in order to calculate the transitional probabilities. A defaultdict works exactly like a normal dict, but it is initialized with a function (“default factory”) that takes no arguments and provides the default value for a nonexistent key. A defaultdict will never raise a KeyError. Any key that does not exist gets the value returned by the default factory.
A defaultdict uses certain logic which means that if a key is not found in the dictionary, then instead of a KeyError being thrown, a new entry is created.
## Function for actual transition states
from collections import defaultdict
def transition_prob(trans_dict):
list_of_unique_channels = set(x for element in list_of_paths for x in element)
trans_prob = defaultdict(dict)
for state in list_of_unique_channels:
if state not in ['Purchase', 'Null']:
counter = 0
index = [i for i, s in enumerate(trans_dict) if state + '>' in s]
for col in index:
if trans_dict[list(trans_dict)[col]] > 0:
counter += trans_dict[list(trans_dict)[col]]
for col in index:
if trans_dict[list(trans_dict)[col]] > 0:
state_prob = float((trans_dict[list(trans_dict)[col]])) / float(counter)
trans_prob[list(trans_dict)[col]] = state_prob
return trans_prob
## Calling transition_prob function
trans_prob = transition_prob(trans_states)
A stochastic matrix (or transition matrix) is a square matrix that is used to describe the transitions of a Markov chain. Each of its entries is a nonnegative real number representing a probability. The matrix is used to tally the transition probabilities. Every state in the state space is included once as a row and again as a column, and each cell in the matrix tells you the probability of transitioning from its row's state to its column's state. If the state space adds one state, we add one row and one column, adding one cell to every existing column and row. This means the number of cells grows quadratically as we add states to our Markov chain.
The function below helps in converting the transition probabilities dictionary into a matrix. The matrix will be stored in a pandas data frame.
## Function for transition matrix
def transition_matrix(list_of_paths, transition_probabilities):
trans_matrix = pd.DataFrame()
list_of_unique_paths = set(x for element in list_of_paths for x in element)
for path in list_of_unique_paths:
trans_matrix[path] = 0.00
trans_matrix.loc[path] = 0.00
trans_matrix.loc[path][path] = 1.0 if path in ['Purchase', 'Null'] else 0.0
for key, value in transition_probabilities.items():
origin, destination = key.split('>')
trans_matrix.at[origin, destination] = value
return trans_matrix
## Calling transition_matrix function
trans_matrix = transition_matrix(list_of_paths, trans_prob)
## Format trans_matrix
trans_matrix = trans_matrix[['Start', 'celebrity recommendation', 'customer review', 'home page', 'product', \
'video review', 'Null', 'Purchase']]
reorderlist = ['Start', 'celebrity recommendation', 'customer review', 'home page', 'product', 'video review', \
'Null', 'Purchase']
trans_matrix = trans_matrix.reindex(reorderlist)
This heatmap helps quickly spot trends and associations between two variables. In this case – which types of pages are users most likely to click to from a given page (or page type) on the site.
This heat map shows various things:
1. The homepage is the most likely page a user will land on when beginning their journey. This is determined because
the 'Start' (signifying the start of a session) and 'home page' are linked with the highest score (0.35). A 'product'
page is quite a popular landing page as well. This might because of the company is running a Google Ads campaign
and is linking their ads to their product pages.
2. From a 'celebrity recommendation' the most common next action is to leave the site completely
3. The highest probability of the next page being a 'Purchase' is when the current page is 'celebrity recommendation'
4. A video review page view is most likely to lead to a site exit while a product page view is most likely to lead to a
visit back to the home page.
Using historical context and the heat map below we gain insights into how each website page is driving users towards the purchase event, and we also gain critical information around how the website pages are interacting with each other. Given today’s typical multi-touch and multi-page website this information can prove to be extremely valuable and can allow to optimize the website and website pages to facilitate the customer towards a purchase.
## Plotly Heatmap
import plotly.graph_objects as go
fig = go.Figure(data=go.Heatmap(
z=[list(trans_matrix.iloc[0]), list(trans_matrix.iloc[1]), list(trans_matrix.iloc[2]), \
list(trans_matrix.iloc[3]), list(trans_matrix.iloc[4]), list(trans_matrix.iloc[5]), \
list(trans_matrix.iloc[6]), list(trans_matrix.iloc[7])],
x=reorderlist,
y=reorderlist,
hoverongaps = False,
colorscale = 'Portland'))
fig.update_layout(template = "plotly_dark",
title="Next Page Viewed",
xaxis_title="Next Page",
yaxis_title="Current Page",
width = 900,
height = 900)
fig.show()
This following section will iteratively go through each of the website pages and assess the impact it would have on overall purchases if the page were to be removed from the state-space (website). This will be calculated and the resulting removal effects added to an output dictionary.
The removal effect is defined as the percentage of conversions we'd miss out on if a given channel or tactic was removed from the system. In other words, if we create one new model for each page where that page is set to 100% no purchases, we will have a new model that highlights the effect that removing that page entirely had on the overall system.
Mathematically speaking, we'd be taking the percent difference in the purchase rate of the overall system with a given page set to NULL against the purchase rate of the overall system. We would do this for each page. Then we create a weighting for each of them based off of the sum of removal effects and then we could finally then multiply that number by the number of purchases to arrive at the fractionally attributed number of purchases.
## removal_effects funtion
def removal_effects(df, conversion_rate):
removal_effects_dict = {}
channels = [channel for channel in df.columns if channel not in ['Start',
'Null',
'Purchase']]
for channel in channels:
removal_df = df.drop(channel, axis=1).drop(channel, axis=0)
for column in removal_df.columns:
row_sum = np.sum(list(removal_df.loc[column]))
null_pct = float(1) - row_sum
if null_pct != 0:
removal_df.loc[column]['Null'] = null_pct
removal_df.loc['Null']['Null'] = 1.0
removal_to_conv = removal_df[
['Null', 'Purchase']].drop(['Null', 'Purchase'], axis=0)
removal_to_non_conv = removal_df.drop(
['Null', 'Purchase'], axis=1).drop(['Null', 'Purchase'], axis=0)
removal_inv_diff = np.linalg.inv(
np.identity(
len(removal_to_non_conv.columns)) - np.asarray(removal_to_non_conv))
removal_dot_prod = np.dot(removal_inv_diff, np.asarray(removal_to_conv))
removal_cvr = pd.DataFrame(removal_dot_prod,
index=removal_to_conv.index)[[1]].loc['Start'].values[0]
removal_effect = 1 - removal_cvr / conversion_rate
removal_effects_dict[channel] = removal_effect
return removal_effects_dict
## Calling the removal effects function
removal_effects_dict = removal_effects(trans_matrix, base_conv_rate)
The removal_effects_dict is directly used to calculate the Markov chain attributions for each of the actual pages.
## Markov chain allocations functions
def markov_chain_allocations(removal_effects, total_purchases):
re_sum = np.sum(list(removal_effects.values()))
return {k: (v / re_sum) * total_purchases for k, v in removal_effects.items()}
attributions = markov_chain_allocations(removal_effects_dict, total_purchases)
## Create DF from attributions
attribution_df = pd.DataFrame.from_dict(attributions, orient = 'index')
attribution_df.reset_index(inplace = True)
attribution_df.columns = ['page', 'attribution count']
## Plotly bar chart
import plotly.express as px
fig = px.bar(attribution_df, x = 'page', y = 'attribution count', color = 'page', \
color_discrete_sequence = px.colors.qualitative.Plotly)
fig.update_layout(template = "plotly_dark",
title="Purchases Attributed to Each Page",
xaxis_title = "Page",
yaxis_title = "Purchase Attribution Count",
width = 900,
height = 700,
showlegend = False)
fig.show()
The current website and system we are studying does not have many states, at a minimum it has 5 possible states, 1 for each website page. If we consider the 'Start', 'Null' and 'Purchase' options than the system has a total of 8.
The markovclick package is a Python implementation of the R package clickstream which models website clickstreams as Markov chains. markovclick allows us to model clickstream data from websites as Markov chains, which can then be used to predict the next likely click on a website for a user, given their history and current state. This will be used to visualise the chain for our website.
The tdqm package will be used to visualise the progress of the for loop needed to create the list of user paths. tqdm derives from the Arabic word taqaddum (تقدّم) which can mean “progress,” and is an abbreviation for “I love you so much” in Spanish (te quiero demasiado). It instantly make the loops show a smart progress meter - by wrapping any iterable with tqdm(iterable).
## Import modules
from tqdm import tqdm_notebook as tqdm
from markovclick import dummy
from markovclick.viz import visualise_markov_chain
from markovclick.models import MarkovClickstream
import os
## Journey/Path list
journey_list = []
session= set(analysis_df["user_session_id"])
for i in tqdm(session):
session_set = analysis_df[analysis_df["user_session_id"] == i]
session_journey = list(np.array(session_set["category"]))
journey_list.append(session_journey)
## View model
os.environ["PATH"] += os.pathsep + 'D:/anononda/Library/bin/graphviz-2.38/release/bin'
model = MarkovClickstream(journey_list)
graph = visualise_markov_chain(model)
graph
The transitional probabilities are different to the ones we see on the heatmap displayed above because these calculations only include the five main website pages. This is the Markov chain and it shows the states along with the probabilities of changing state and a states self-looping probability.
A sankey diagram is a visualisation used to depict a flow from one set of values to another. The things being connected are called nodes and the connections are called links. Sankeys are best used when you want to show a many-to-many mapping between two domains or multiple paths through a set of stages, perfect in this case to show how website traffic flows from page to page on the ecommerce site.
Sankey diagrams can display different data types over two dimensions: the number of nodes and the links’ width. In the case of user journeys, the number of nodes can relay the website events quantity and chronological order information, and the width of the links can display the proportion of users who moved from a specific website page to another.
When it comes to user journey analysis, these diagrams allow to identify at a glance what are the most frequently page events, in what order, or what are the different paths from page A to page B. These are information that marketers, decision-makers or website devs. are likely to be interested in.
## Function to add 2 minutes to purchase event
import datetime as dt
def add_two_mins(time_string):
#the_time = dt.datetime.strptime(time_string, '%H:%M')
new_time = time_string + dt.timedelta(minutes=2)
return new_time.strftime('%H:%M:%S')
## Append purchase event to analysis_df
purchase_df = analysis_df[analysis_df['is_purchased'] == 1].copy()
purchase_df['timestamp'] = purchase_df['timestamp'].map(add_two_mins)
purchase_df['category'] = 'purchase'
analysis_df = pd.concat([analysis_df, purchase_df], axis = 0)
analysis_df['timestamp'] = pd.to_datetime(analysis_df['timestamp'])
The sankey diagram will contain the following:
Every website event of the user journey: the nodes of the chart will stand for all the website pages, from entrance
to the Nth event. Here, we’ll go up to the 10th event.
The funnel of pages of every user: each time we see that a user visited a new page, we’ll link this page to his
previous page. The width of the links will increase as we observe more users completing the same sequence of in-app
events.
Average time between two events: we’ll also compute the time between every event for each user, to be able to compute
the average time between each step of the funnel.
In order to do this the data must be processed, organized and cleansed to meet the sankey diagram structure. The code from this section has been recycled from a fantastic article by Nicolas Esnis explaining the use of sankey diagrams and their use for both app developers and marketers.
## Create initial DF from analysis_df
entances = analysis_df[['user_session_id', 'timestamp']].sort_values('timestamp').drop_duplicates('user_session_id')
## Add columns with inital ('entrance') event name/type
entances['event_name'] = 'entances'
entances['event_type'] = 'entances'
entances.rename(columns={'timestamp': 'time_event'}, inplace=True)
## Drop duplicates with new DF
data = analysis_df[['user_session_id', 'category', 'timestamp']].drop_duplicates()
data.rename(columns={'timestamp': 'time_event','category' : 'event_name'}, inplace=True)
data['event_type'] = 'in_app_action'
## Concatanate both DFs
data = pd.concat([data, entances[data.columns]])
data.sort_values(['user_session_id', 'event_type', 'time_event'], ascending = [True, True, True], inplace = True)
grouped = data.groupby('user_session_id')
## Function to get event time
def rank(x): return x['time_event'].rank(method='first').astype(int)
data["rank_event"] = grouped.apply(rank).reset_index(0, drop=True)
grouped = data.groupby('user_session_id')
## Function to get next event
def get_next_event(x): return x['event_name'].shift(-1)
data["next_event"] = grouped.apply(lambda x: get_next_event(x)).reset_index(0, drop=True)
grouped = data.groupby('user_session_id')
## Function to get time difference between events
def get_time_diff(x): return x['time_event'].shift(-1) - x['time_event']
data["time_to_next"] = grouped.apply(lambda x: get_time_diff(x)).reset_index(0, drop=True)
## Condition to limit to less than 11th event
data = data[data.rank_event < 8]
data[data['rank_event'] == 1].event_name.unique()
The data frame now has the user_session_id, the event_name and the time and type of that event, the rank (order) that event is, the next_event and the time to the next event. It is previewed below
## Preview DF
data.head()
Plotly’s Sankey data structure is made of two Python dictionaries: node and link. The link dict takes four Python lists as parameters:
source: a list of each flow’s source node. To have a source per event flow, we’ll need to append a unique index per
event and per event rank to this list.
target: a list of target nodes. We’ll map every source event to its target event, using the next_event column.
value: a list containing each flow’s volume information. We’ll keep count of every time we map a source to a
target and append this count to the value list.
label: a list of metadata that will be displayed when hovering each link. Here, we’ll want to know the number
of users going from a source event to a target, and the average time it took them to do so.
The node dict takes the two following parameters:
label: a list containing the nodes’ names (in our case, the events’ names);
color: an optional list containing the nodes’ color information.
A unique source_index for each event at each step of the user journey is created, keeping track of every page's name.
The first level keys are the rank of each event. Rank #1 only contains 'entrance' as a source. We store a unique source_index rather than the pages’ names, since the events can occur multiple times within the journey. We also attribute a unique color to each page, and name this dict nodes_dict.
# Working on the nodes_dict
all_events = list(data.event_name.unique())
# Create a set of colors that you'd like to use in your plot.
palette = ['50BE97', 'E4655C', 'FCC865',
'BFD6DE', '3E5066', '353A3E', 'E6E6E6']
# Here, I passed the colors as HEX, but we need to pass it as RGB. This loop will convert from HEX to RGB:
for i, col in enumerate(palette):
palette[i] = tuple(int(col[i:i+2], 16) for i in (0, 2, 4))
# Append a Seaborn complementary palette to your palette in case you did not provide enough colors to style every event
complementary_palette = sns.color_palette(
"deep", len(all_events) - len(palette))
if len(complementary_palette) > 0:
palette.extend(complementary_palette)
output = dict()
output.update({'nodes_dict': dict()})
i = 0
for rank_event in data.rank_event.unique(): # For each rank of event...
# Create a new key equal to the rank...
output['nodes_dict'].update(
{rank_event: dict()}
)
# Look at all the events that were done at this step of the funnel...
all_events_at_this_rank = data[data.rank_event ==
rank_event].event_name.unique()
# Read the colors for these events and store them in a list...
rank_palette = []
for event in all_events_at_this_rank:
rank_palette.append(palette[list(all_events).index(event)])
# Keep trace of the events' names, colors and indices.
output['nodes_dict'][rank_event].update(
{
'sources': list(all_events_at_this_rank),
'color': rank_palette,
'sources_index': list(range(i, i+len(all_events_at_this_rank)))
}
)
# Finally, increment by the length of this rank's available events to make sure next
#indices will not be chosen from existing ones
i += len(output['nodes_dict'][rank_event]['sources_index'])
For every user’s sequence of events, we’ll need to:
1) Read in nodes_dict the unique source_index of every event in the sequence.
2) Likewise, read the source_index of each event’s next event (the target indices of rank N are retrieved from the source indices of rank N+1) and store it into a target_index variable.
3) Check if the combination of the source_index and the target_index is already a key of links_dict. If not, we’ll create it. If it is, we’ll increment the count of unique users, and add the time_to_next information. Later, by dividing the time_to_next by the count of unique users, we’ll have the average time from an event to another.
# Working on the links_dict
output.update({'links_dict': dict()})
# Group the DataFrame by user_id and rank_event
grouped = data.groupby(['user_session_id', 'rank_event'])
# Define a function to read the souces, targets, values and time from event to next_event:
def update_source_target(user):
try:
# user.name[0] is the user's user_id; user.name[1] is the rank of each action
# 1st we retrieve the source and target's indices from nodes_dict
source_index = output['nodes_dict'][user.name[1]]['sources_index'][output['nodes_dict']
[user.name[1]]['sources'].\
index(user['event_name'].values[0])]
target_index = output['nodes_dict'][user.name[1] + 1]['sources_index'][output['nodes_dict']
[user.name[1] + 1]['sources'].\
index(user['next_event'].values[0])]
# If this source is already in links_dict...
if source_index in output['links_dict']:
# ...and if this target is already associated to this source...
if target_index in output['links_dict'][source_index]:
# ...then we increment the count of users with this source/target pair by 1,
#and keep track of the time from source to target
output['links_dict'][source_index][target_index]['unique_users'] += 1
output['links_dict'][source_index][target_index]['avg_time_to_next'] += user['time_to_next'].values[0]
# ...but if the target is not already associated to this source...
else:
# ...we create a new key for this target, for this source, and initiate it with 1 user
#and the time from source to target
output['links_dict'][source_index].update({target_index:
dict(
{'unique_users': 1,
'avg_time_to_next': user['time_to_next'].values[0]}
)
})
# ...but if this source isn't already available in the links_dict, we create its key
#and the key of this source's target, and we initiate it with 1 user and the time from source to target
else:
output['links_dict'].update({source_index: dict({target_index: dict(
{'unique_users': 1, 'avg_time_to_next': user['time_to_next'].values[0]})})})
except Exception as e:
pass
# Apply the function to your grouped Pandas object:
grouped.apply(lambda user: update_source_target(user))
Before being able to plot our Sankey diagram, we’ll need to create the targets, sources, values, labels and colors lists from our dictionaries, that will be passed as parameters in the plotting function. This can be achieved easily by iterating over our nodes_dict and links_dict:
targets = []
sources = []
values = []
time_to_next = []
for source_key, source_value in output['links_dict'].items():
for target_key, target_value in output['links_dict'][source_key].items():
sources.append(source_key)
targets.append(target_key)
values.append(target_value['unique_users'])
time_to_next.append(str(pd.to_timedelta(
target_value['avg_time_to_next'] / target_value['unique_users'])).split('.')[0])
# Split to remove the milliseconds information
labels = []
colors = []
for key, value in output['nodes_dict'].items():
labels = labels + list(output['nodes_dict'][key]['sources'])
colors = colors + list(output['nodes_dict'][key]['color'])
for idx, color in enumerate(colors):
colors[idx] = "rgb" + str(color) + ""
The sankey diagram is a fantastic way to visualise flow through different states. The sankey above visualizes the first 7 actions (pages) users take on the site. Most users leave the site before visiting 7 different pages, while others continue to view more pages or purchase a product. If those purchases or page views are the 8th of higher event on their journey it is not represented above. They could be represented by tweaking the code that was used to generate the plot however, for sake of simplicity and clean visualisations the the journey has been limited to 7 pages.
In the representation below, all visitor journeys begin at the 'entrance' state. From here on all the visitors split and travel into one of the 5 pages: the home page, a customer review, video review page, a celebrity recommendation page or a product page. At this step users can either continue to other pages on the site, make a purchase of leave. The sankey visualizes just that, it helps see the drop off rate and the continuation flow.
import plotly.graph_objects as go
import plotly.offline as py
import plotly.tools as tls
import plotly
py.init_notebook_mode(connected = True)
fig = go.Figure(data = [go.Sankey(
node = dict(
thickness = 10, # default is 20
line = dict(color = "black", width = 0.5),
label = labels,
color = colors
),
visible = True,
link=dict(
source = sources,
target = targets,
value = values,
label = time_to_next,
hovertemplate = '%{value} unique users went from %{source.label} to %{target.label}.<br />' +
'<br />It took them %{label} in average.<extra></extra>',
))])
fig.update_layout(title_text="Website User Flow", plot_bgcolor = 'white', autosize = False,
width = 1500,
height = 1000,
showlegend=True
)
py.iplot(fig, filename = "test-graph")
This section aims to clear and visualise what might be lost in the large quantity of data analyzed in the previous sankey. The sankey visualised below shows the flow for users either exiting the site or purchasing a product. It displays the flow from the last page before one of those two options occur. The aim here is to see what pages directly lead to the most purchases and the most exits. From this a decision maker can begin to form an understanding of how a page may impact a user’s site and purchasing behavior.
## Function for actual transition states
from collections import defaultdict
def transition_count(trans_dict):
list_of_unique_channels = set(x for element in list_of_paths for x in element)
trans_count = defaultdict(dict)
for state in list_of_unique_channels:
if state not in ['Purchase', 'Null']:
counter = 0
index = [i for i, s in enumerate(trans_dict) if state + '>' in s]
for col in index:
if trans_dict[list(trans_dict)[col]] > 0:
counter += trans_dict[list(trans_dict)[col]]
for col in index:
if trans_dict[list(trans_dict)[col]] > 0:
state_prob = float((trans_dict[list(trans_dict)[col]]))
trans_count[list(trans_dict)[col]] = state_prob
return trans_count, list_of_unique_channels
## Get transition_counts
trans_count = transition_count(trans_states)
## Creat DF for analysis
path_count_df = pd.DataFrame.from_dict(trans_count[0], orient = 'index')
path_count_df.reset_index(inplace = True)
path_count_df.columns = ['path', 'count']
path_count_df[['source', 'target']] = path_count_df['path'].str.split('>', expand = True)
path_count_df.drop('path', axis =1, inplace = True)
Below the source and target pages are mapped to categorical values which are used by the plotly object to create the sankey flow.
## Map page names to categorical values
source_codes = {'Start' : 0}
target_codes = {'customer review' : 1, 'home page' : 2, 'celebrity recommendation': 3, 'product' : 4, 'video review' : 5}
path_count_df['source_#'] = path_count_df['source']
path_count_df['source_#'] = path_count_df['source_#'].map(source_codes)
path_count_df['target_#'] = path_count_df['target']
path_count_df['target_#'] = path_count_df['target_#'].map(target_codes)
path_count_df = path_count_df[['source', 'target', 'source_#', 'target_#', 'count']]
path_count_df['path_name'] = path_count_df['source'] + ' -> ' + path_count_df['target']
The data frame below contains the source and target pages along with their categorical values that will be used in the sankey plot. The targets are either 'Null' representing a direct exit from the website and 'Purchase' which signifies that the user has completed a purchase. Naturally, there are many more 'Null' exits than there are purchases.
## Map page names to categorical values
source_codes = {'customer review' : 1, 'home page' : 2, 'celebrity recommendation': 3, 'product' : 4, 'video review' : 5}
target_codes = {'Purchase' : 6, 'Null' : 7}
purch_df = path_count_df.loc[(path_count_df['target'] == 'Purchase') | (path_count_df['target'] == 'Null')]
purch_df['source_#'] = purch_df['source'].map(source_codes)
purch_df['target_#'] = purch_df['target'].map(target_codes)
purch_df
This sankey diagram what the user flow from the last page, before a purchase or exit, is. Unsurprisingly the exits represented by 'Null' are significantly more probable than the purchases. Each of the flows may be highlighted in order to show the size of the flow from one state to another. They states (pages) on the left are ordered in decreasing size, where 'celebrity recommendation' has the largest outflow and 'product', at the bottom, has the smallest outflow.
## Plotly Sankey
data_trace = dict(
type='sankey',
domain = dict(
x = [0,1],
y = [0,1]
),
orientation = "h",
valueformat = ".0f",
node = dict(
pad = 10,
thickness = 30,
line = dict(
color = "black",
width = 0
),
label = ['', 'customer review', 'home page', 'celebrity recommendation' , 'product' ,'video review', \
'Purchase', 'Null'],
#color = purch_df['Color']
),
link = dict(
source = purch_df['source_#'].dropna(axis = 0, how = 'any'),
target = purch_df['target_#'].dropna(axis=0, how = 'any'),
value = purch_df['count'].dropna(axis = 0, how = 'any')#,
#color = path_count_df['Link Color'].dropna(axis=0, how='any'),
)
)
layout = dict(
title = "Last Page Viewed Before Exit or Purchase",
height = 772,
font = dict(
size = 10
),
)
fig = dict(data = [data_trace], layout = layout)
py.iplot(fig, validate = False)
Networks are graphs which are made out of nodes and edges and they are present everywhere. Social networks composed of users owned by the likes of Facebook and Twitter contain sensitive relationship data, biological networks help analyze patterns in biological systems, such as food-webs and predator-prey interactions and narrative networks help identify key actors and the key communities or parties they are involved with. The study of a network is essential in order to learn about its information spread, players of influence and its robustness. Networks inherently contain communities, areas of densely connected nodes which provide information about the network, among that information, it allows for the creation of large scale maps of a network since individual communities act like meta-nodes in the network. NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
The pages on the website and the respective events associated on each page may be thought of as a network. In this network the nodes are the pages of the site and the edges are the flow of users that access one page from the other. This section will visualise the traffic between website pages as a directed network. A directed network has a set of nodes connected by edges, where the edges have a direction associated with them. The network will be iteratively improved through an initial default (vanilla) plot and further enhancements coming by finding the optimal layout. Finally color and thickness of the edges between nodes will be plotted as a function of their respective weight.
## Path count DF
path_count_df = pd.DataFrame.from_dict(trans_count[0], orient = 'index')
path_count_df.reset_index(inplace = True)
path_count_df.columns = ['path', 'count']
path_count_df[['source', 'target']] = path_count_df['path'].str.split('>', expand = True)
path_count_df.drop('path', axis = 1, inplace = True)
path_count_df = path_count_df[['source', 'target', 'count']]
NetworkX can be used to visualise the network of pages, including the entrance ('Start') and the purchase or non-purchase event. Below is a first quick visualisation which will be worked on and iteratively improved to help draw inferences.
## Basic initial plot
import networkx as nx
G0 = nx.from_pandas_edgelist(path_count_df, 'source', 'target', edge_attr = 'count', create_using=nx.DiGraph())
nx.draw(G0, arrows = True, connectionstyle='arc3, rad = 0.1')
Different graph layouts in networkx utilise different graphing algorithms which cause the structure and thus the visualisation of the network to vary. Due to the nature of the data, different algorithms will suit the network more and less. It is therefore important, for interpretability, to ensure the correct layout has been selected.
This section below builds on the basic plot above and introduced the circular, random, spectral, Fructerman-Reingold, Kamada-Kawai and shell layouts that are made available by the networkx module. A brief description is provided below in order to gain familiarity with the potential representations.
Circular layout: In graph drawing, a circular layout is a style of drawing that places the vertices of a graph on
a circle, often evenly spaced so that they form the vertices of a regular polygon.
Random layout: In graph drawing, a random layout positions nodes uniformly at random in the unit square.
Spectral layout: The spectral layout is a class of algorithm for drawing graphs. The layout uses the eigenvectors
of a matrix, such as the Laplace matrix of the graph, as Cartesian coordinates of the graph's vertices.
Fruchterman-Reingold (Spring) layout: The Fruchterman-Reingold Algorithm is a force-directed layout algorithm.
The idea of a force directed layout algorithm is to consider a force between any two nodes. In this algorithm,
the nodes are represented by steel rings and the edges are springs between them.
Kamada-Kawai layout: Places the nodes on the plane based on a physical model of springs, using Kamada-Kawai
path-length cost-function
Shell layout: Positions nodes in concentric circles
The code below creates a large plot with 6 distinct subplots, each containing a network created by the 6 individual graph layouts. A best one will be selected and further improved.
## The following code will explore various different tyeps of graph layouts
graph_layouts = [nx.circular_layout(G0), nx.random_layout(G0), \
nx.spectral_layout(G0), nx.fruchterman_reingold_layout(G0), \
nx.kamada_kawai_layout(G0), nx.shell_layout(G0)]
names = ['Circular', 'Random', 'Spectral', 'Fructerman-Reingold', 'Kamada-Kawai', 'Shell']
#for graph in graph_layouts:
#plt.subplots(figsize=(8, 8))
#nx.draw(G0, with_labels = True, node_size = 700, node_color = "#e1575c", edge_color = '#363847', pos = graph)
fig, axes = plt.subplots(nrows = 3, ncols = 2, figsize = (20, 20))
ax = axes.flatten()
for i, graph, name in zip(range(6), graph_layouts, names):
ax[i].set_title(f'{name} Layout', size = 14)
nx.draw(G0, ax=ax[i], with_labels = True, node_size = 700, node_color = "#e1575c", edge_color = '#363847', pos = graph, \
arrows = True, connectionstyle='arc3, rad = 0.5')
ax[i].set_axis_off()
for ax in axes.flatten():
ax.set_xticks([])
ax.set_yticks([])
plt.gca().spines['right'].set_color('none')
plt.gca().spines['top'].set_color('none')
plt.gca().spines['left'].set_color('none')
plt.gca().spines['bottom'].set_color('none')
plt.show()
Upon inspection the circular layout seems to generalize well and visualise each node and edge to a good standard compared to the other layouts. The spectral and Fructerman-Reingold algorithms seem to concentrate the main website pages in the middle because of the associated weight on the edges.
The circular layout can still be improved by adding color and most importantly scaling the edges between website pages by the amount of traffic that goes from page A to page B. That way upon quick inspection the user is able to determine which pages lead to other pages.
All the edges leading to the 'Purchase' node are thin in comparison to the others, that is because there aren't many purchases in general. The 'Start' node only has outgoing edges and that is because the 'Start' is the page/node where the users land on and therefore "start" their website journey. On the other hand the 'Purchase' and 'Null' nodes only have incoming edges because they are the last possible event/page a user may experience. All the other nodes have 6 incoming (5 pages + 'Start') and 7 outgoing edges (5 pages + 'Purchase' + 'Null')
## Color and weight plot
u = list(path_count_df['source'])
v = list(path_count_df['target'])
w = list(path_count_df['count'])
G = nx.DiGraph()
for ui, vi, wi in zip(u, v, w):
G.add_edges_from([(ui, vi)], weight = wi)
#graph_layouts = [nx.circular_layout(G), nx.random_layout(G), nx.spring_layout(G), \
#nx.spectral_layout(G), nx.fruchterman_reingold_layout(G)]
#for graph in graph_layouts:
#pos = graph
pos = nx.circular_layout(G)
edge_labels = dict([((u, v,), d['weight']) for u, v, d in G.edges(data = True)])
weights = [G[u][v]['weight'] for u, v in G.edges()]
weights = list(map(lambda x: (x - min(weights)) /
(max(weights) - min(weights)), weights))
weights = list(map(lambda x: (x * 12) + 1, weights))
i = 0
for u, v in G.edges():
i += 1
fig = plt.figure(figsize = (25, 20))
plt.axis('off')
#nx.draw_networkx_edge_labels(G, pos, edge_labels = edge_labels, label_pos=0.3)
nx.draw_networkx_nodes(G, pos,
nodelist = G.nodes(),
node_color = 'r',
node_size = 800, arrows = True)
nx.draw_networkx_edges(G, pos,
edgelist = G.edges(),
alpha = 0.5, edge_color ='#5cce40', width = weights, arrows = True, connectionstyle = 'arc3, rad = 0.5', \
arrowsize = 20, arrowstyle = '|-|')
nx.draw_networkx_labels(G, pos, font_size = 16, font_color = 'white')
#csfont = {'fontname':'Times New Roman'}
fig.set_facecolor("#262626")
plt.title("Circular Network Representation of the Website's Pages", size = 24, color = 'White')
plt.show()
Collecting data on each step in an user flow will allow to evaluate how users navigate through the sales funnel. By their very nature, funnels will shrink at each step, where users drop out. Data will indicate where the funnel is ‘leaky’ (with a large percentage of people dropping out between steps) and might need help.
To close up the ‘leaks,’ decision makes must consider where they can correct points of pain or friction, where to offer more information, and where to reduce distractions and offer less. This helps guide users and facilitates decisions and journeys, speeding and making the process easier. The project above showed various ways to visualise journeys of tens of thousands of users. Different methods have different positives and negatives and help a reader understand different aspects of the funnel. Management should have a clearer idea as to how users are interacting with their site. This is the start, from here on, in order to achieve results aspects of the website need to be amended. Again, there are many ways to approach this task, a good, simple and popular one however, is A /B testing.
A/B testing is the process of comparing two different versions of a site or app against each other to see which one performs better using real world data. A/B testing is a great way to validate hypotheses about changes to the site or app in question. By going through the user flows, identifying opportunities for improvement, and testing different ideas it is possible to continually improve the conversion rates. A/B testing tools offered by Google Tag Manager and Google Optimize make it easy to make changes to your site or app and provides data that show exactly how much of an impact changes will have on your core metrics.
https://setosa.io/ev/markov-chains/
https://en.wikipedia.org/wiki/Markov_chain
https://medium.com/@mortenhegewald/marketing-channel-attribution-using-markov-chains-101-in-python-78fb181ebf1e
https://plotly.com/python/sankey-diagram/
https://developers.google.com/chart/interactive/docs/gallery/sankey
https://en.wikipedia.org/wiki/Sankey_diagram
https://www.analyticsvidhya.com/blog/2018/04/introduction-to-graph-theory-network-analysis-python-codes/
https://maplarge.com/networkgraphs
https://networkx.github.io/